Since Jan. 1, 2015, The Washington Post has been compiling a database of every fatal shooting in the US by a police officer in the line of duty.

While there are many challenges regarding data collection and reporting, The Washington Post has been tracking more than a dozen details about each killing. This includes the race, age and gender of the deceased, whether the person was armed, and whether the victim was experiencing a mental-health crisis. The Washington Post has gathered this supplemental information from law enforcement websites, local new reports, social media, and by monitoring independent databases such as "Killed by police" and "Fatal Encounters". The Post has also conducted additional reporting in many cases.
There is additional dataset of US census data on racial demographics. Source of census data.
The objective of this analysis is to support the arguments that police officers are racialy biased and I hope it will provide thoughtful, fact-based analysis of this important issue.
The orginal The Washington Post's analysis here.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import matplotlib.pyplot as plt
import seaborn as sns
pio.templates.default = "simple_white"
pd.options.display.float_format = '{:,.2f}'.format
df_share_race_city = pd.read_csv('Share_of_Race_By_City.csv', encoding="windows-1252")
df_fatalities = pd.read_csv('Deaths_by_Police_US.csv', encoding="windows-1252")
column_list = ["share_white", "share_black", "share_native_american", "share_asian", "share_hispanic"]
for column in column_list:
df_share_race_city[column] = pd.to_numeric( df_share_race_city[column], errors="coerce")
df_share_race_city.duplicated().any()
df_share_race_city
| Geographic area | City | share_white | share_black | share_native_american | share_asian | share_hispanic | |
|---|---|---|---|---|---|---|---|
| 0 | AL | Abanda CDP | 67.20 | 30.20 | 0.00 | 0.00 | 1.60 |
| 1 | AL | Abbeville city | 54.40 | 41.40 | 0.10 | 1.00 | 3.10 |
| 2 | AL | Adamsville city | 52.30 | 44.90 | 0.50 | 0.30 | 2.30 |
| 3 | AL | Addison town | 99.10 | 0.10 | 0.00 | 0.10 | 0.40 |
| 4 | AL | Akron town | 13.20 | 86.50 | 0.00 | 0.00 | 0.30 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 29263 | WY | Woods Landing-Jelm CDP | 95.90 | 0.00 | 0.00 | 2.10 | 0.00 |
| 29264 | WY | Worland city | 89.90 | 0.30 | 1.30 | 0.60 | 16.60 |
| 29265 | WY | Wright town | 94.50 | 0.10 | 1.40 | 0.20 | 6.20 |
| 29266 | WY | Yoder town | 97.40 | 0.00 | 0.00 | 0.00 | 4.00 |
| 29267 | WY | Y-O Ranch CDP | 92.80 | 1.50 | 2.60 | 0.00 | 11.80 |
29268 rows × 7 columns
We have 20 cities with missing values for race shares. We could find missing information, but for now we just keep it in mind during analysis.
df_fatalities["date"] = pd.to_datetime(df_fatalities["date"])
race_dict = {"W": "White",
"B": "Black",
"H": "Hispanic",
"A": "Asian",
"N": "Native American",
"O": "Other",
}
df_fatalities.race = df_fatalities.race.map(race_dict)
df_fatalities.duplicated().any()
df_fatalities
| id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | longitude | latitude | is_geocoding_exact | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | Tim Elliot | 2015-01-02 | shot | gun | 53.00 | M | Asian | Shelton | WA | True | attack | Not fleeing | False | -123.12 | 47.25 | True |
| 1 | 4 | Lewis Lee Lembke | 2015-01-02 | shot | gun | 47.00 | M | White | Aloha | OR | False | attack | Not fleeing | False | -122.89 | 45.49 | True |
| 2 | 5 | John Paul Quintero | 2015-01-03 | shot and Tasered | unarmed | 23.00 | M | Hispanic | Wichita | KS | False | other | Not fleeing | False | -97.28 | 37.70 | True |
| 3 | 8 | Matthew Hoffman | 2015-01-04 | shot | toy weapon | 32.00 | M | White | San Francisco | CA | True | attack | Not fleeing | False | -122.42 | 37.76 | True |
| 4 | 9 | Michael Rodriguez | 2015-01-04 | shot | nail gun | 39.00 | M | Hispanic | Evans | CO | False | attack | Not fleeing | False | -104.69 | 40.38 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6805 | 7422 | NaN | 2021-12-11 | shot | unknown weapon | NaN | M | NaN | Elizabethtown | KY | False | undetermined | NaN | False | NaN | NaN | True |
| 6806 | 7424 | Robert Lee Engle | 2021-12-11 | shot | undetermined | 64.00 | M | NaN | Frankfort | KY | False | undetermined | NaN | False | NaN | NaN | True |
| 6807 | 7418 | Patrick Horton | 2021-12-12 | shot | gun | 39.00 | M | NaN | Cleveland | OH | False | attack | Not fleeing | False | NaN | NaN | True |
| 6808 | 7427 | NaN | 2021-12-12 | shot | gun | NaN | M | NaN | Ferguson | MO | False | attack | NaN | False | NaN | NaN | True |
| 6809 | 7431 | George Hollins | 2021-12-12 | shot | gun | 26.00 | M | NaN | Jennings | MO | False | other | Not fleeing | False | NaN | NaN | True |
6810 rows × 17 columns
We have some missing values for the name, armed, age, race, flee, long and lat columns. We just keep it in mind doing particular data exploration. We also convert the capital letter symbols for race to a particular full name of the race.
df_race = df_fatalities.race.value_counts().to_frame()
df_race.index.name = "race"
df_race.rename(columns={"race":"killed"}, inplace=True)
df_race["pct_in_population"] = pd.Series({"White": 54.6,
"Black": 14.9,
"Hispanic": 17.1,
"Asian": 8.2,
"Native American": 2.8,
"Other": 2.4})
total_population = 331449281
df_race["population"] = total_population/100*df_race.pct_in_population
df_race["relative_killed_per_mil"] = df_race.killed.values/df_race.population.values*1000000
df_race
| killed | pct_in_population | population | relative_killed_per_mil | |
|---|---|---|---|---|
| race | ||||
| White | 2970 | 54.60 | 180,971,307.43 | 16.41 |
| Black | 1557 | 14.90 | 49,385,942.87 | 31.53 |
| Hispanic | 1084 | 17.10 | 56,677,827.05 | 19.13 |
| Asian | 106 | 8.20 | 27,178,841.04 | 3.90 |
| Native American | 91 | 2.80 | 9,280,579.87 | 9.81 |
| Other | 47 | 2.40 | 7,954,782.74 | 5.91 |
Let's do some initial analysis.
gender = df_fatalities.gender.value_counts()
gender.index=["MEN ","WOMEN "]
gender_bar = px.bar(y=gender.index,
x=gender.values,
orientation="h",
labels={"y":"", "x":"count"},
title="Killing by gender.",
color=gender.index,
color_discrete_sequence=["#424642","#C06014"],
width=1000,
height=250)
gender_bar.update_layout(title_font_size=17)
gender_bar.update_traces(showlegend=False)
gender_bar.update_yaxes(title_font_size=17,
showline=False,
tickfont_size=13,
ticks="")
gender_bar.update_xaxes(title_font_size=17,
showline=False,
ticks="")
gender_bar.show()
age = px.histogram(x=df_fatalities.age,
labels={"x":"age"},
title="Age distribution.",
color_discrete_sequence=["#C06014"],
width=900,
height=500,
nbins=100,
marginal="box")
age.update_layout(title_font_size=17)
age.update_yaxes(title_font_size=17,
tickfont_size=13)
age.update_xaxes(title_font_size=17)
age.show()
Percentage of people killed under 30 years old.
df_fatalities[df_fatalities.age < 30].id.count()/df_fatalities.id.count()*100
30.440528634361236
Insights:
df_armed = df_fatalities.groupby("armed").armed.count()
df_armed_without = df_armed[df_armed.index != "unarmed"].sort_values(ascending=False)
armed_no = df_armed[df_armed.index=="unarmed"].values.sum()
armed_yes = df_armed[df_armed.index!="unarmed"].values.sum()
armed = px.pie(names=["armed", "unarmed"],
labels=["armed", "unarmed"],
values=[armed_yes, armed_no],
color_discrete_sequence=["#424642","#899292"],
title="Were people armed?",
width=700,
height=500,
)
armed.update_layout(title_font_size=17)
armed.update_traces(pull = [0.1, 0],
textposition='auto',
textinfo='percent+label',
textfont_size=14,
insidetextorientation='auto',
showlegend=False,
rotation=-30)
armed.show()
weapon = px.pie(names=df_armed_without[df_armed_without.values>20].index,
labels=df_armed_without[df_armed_without.values>20].index,
values=df_armed_without[df_armed_without.values>20].values,
color_discrete_sequence=["#424642","#656765","#848684","#848684","#848684","#848684","#899292","#899292","#899292","#899292","#899292"],
title="Weapons.",
width=700,
height=500,
hole=0.5)
weapon.update_layout(title_font_size=17)
weapon.update_traces(pull=[0,0.065,0.15,0.15,0.15,0.15,0.17,0.17,0.17,0.17,0.17],
textposition='auto',
textfont_size=14,
insidetextorientation='auto',
rotation=-30,
textinfo="value")
weapon.show()
Insights:
df_race = df_race.sort_values(by="relative_killed_per_mil", ascending=False)
widths = np.array(df_race.population/10000000)
race = px.bar(x=np.cumsum(widths)-widths,
y=df_race.relative_killed_per_mil,
text=df_race.index,
title="Deaths by race with respective to the race distribution in US population.",
labels={"x":"population in millions", "y":"killings per million"})
race.update_layout(width=800,
height=500,
title_font_size=17)
race.update_traces(width=widths,
offset=0,
textposition="outside",
textangle=0,
textfont_color="black",
textfont_size=13,
constraintext="none",
marker_color=["#C06014","#E6E8DE","#E6E8DE","#E6E8DE","#E6E8DE","#E6E8DE"])
race.update_xaxes(tickmode="array",
tickvals=np.cumsum(widths)-widths/2,
ticktext=round(df_race.population/1000000),
showline=False,
ticks="",
title_font_size=17)
race.update_yaxes(tickfont_size=12,
title_font_size=17,
showline=False,
ticks="")
race.show()
Chi-square test provides a way to investigate differences in the distributions of categorical variables with the same categories and dependence between categorical variables. Let's examine if there is a difference between distribution of race in fatal police shootings and the distribution of race in the US population.
from scipy.stats import chisquare, chi2
chi_table = df_race.loc[:,["killed", "pct_in_population"]]
chi_table["expected"] = (chi_table.killed.sum())/100*chi_table.pct_in_population
chi_table
| killed | pct_in_population | expected | |
|---|---|---|---|
| race | |||
| Black | 1557 | 14.90 | 872.39 |
| Hispanic | 1084 | 17.10 | 1,001.21 |
| White | 2970 | 54.60 | 3,196.83 |
| Native American | 91 | 2.80 | 163.94 |
| Other | 47 | 2.40 | 140.52 |
| Asian | 106 | 8.20 | 480.11 |
chisquare(f_obs=chi_table.killed,
f_exp=chi_table.expected,
ddof=5)
Power_divergenceResult(statistic=946.3852423233955, pvalue=nan)
chi_square_stat = (((chi_table.killed-chi_table.expected)**2)/chi_table.expected).sum()
p_value = 1 - chi2.cdf(x=chi_square_stat, df=5)
print(f"Chi-square statistic: {chi_square_stat}, p-value: {p_value}")
Chi-square statistic: 946.3852423233955, p-value: 0.0
Insights:
One might argue, that the higher proportion of black pleople killed is simply because there are more black people in the particular area in general. Let' s examine this statement.
grouped_share = df_share_race_city.groupby("Geographic area").mean()
grouped_share = pd.DataFrame(grouped_share.stack()).reset_index(level=1, drop=False)
grouped_share.columns = ["Race", "Rate"]
top_black_index = grouped_share[grouped_share.Race == "share_black"].sort_values(by="Rate", ascending=False).head(10)[::-1].index
country_bar = px.bar(x=grouped_share.loc[top_black_index,:].Rate,
y=grouped_share.loc[top_black_index,:].index,
color=grouped_share.loc[top_black_index,:].Race,
orientation="h",
title="Racial Makeup of the US States with most % of black people.",
labels={"color":"", "y":"states","x":"percentage"},
color_discrete_sequence=["#BDBFB6", "#424642", "#536162", "#D7894A", "#C06014"])
country_bar.update_layout(width=1000,
height=600,
barnorm="percent",
title_font_size=17,
legend=dict(
orientation="h",
yanchor="middle",
y=1.04,
xanchor="center",
x=0.5))
country_bar.update_yaxes(tickfont_size=14,
title_font_size=17,
showline=False)
country_bar.update_xaxes(tickfont_size=12,
title_font_size=17,
showline=False,
ticks="")
country_bar.show()
Insights:
states = df_fatalities.state.value_counts().to_frame().rename(columns={"state":"killings"})
populations={
"CA":39538223,
"TX":29145505,
"FL":21538187,
"NY":20201249,
"PA":13002700,
"IL":12801989,
"OH":11799448,
"GA":10711908,
"NC":10439388,
"MI":10077331,
"NJ":9288994,
"VA":8631393,
"WA":7705281,
"AZ":7151502,
"MA":7029917,
"TN":6910840,
"IN":6785528,
"MD":6177224,
"MO":6154913,
"WI":5893718,
"CO":5773714,
"MN":5706494,
"SC":5118425,
"AL":5024279,
"LA":4657757,
"KY":4505836,
"OR":4237256,
"OK":3959353,
"CT":3605944,
"UT":3205958,
"IA":3271616,
"NV":3104614,
"AR":3011524,
"MS":2961279,
"KS":2937880,
"NM":2117522,
"NE":1961504,
"ID":1839106,
"WV":1793716,
"HI":1455271,
"NH":1377529,
"ME":1362359,
"RI":1097379,
"MT":1084225,
"DE":989948,
"SD":886667,
"ND":779094,
"AK":733391,
"DC":689545,
"VT":643077,
"WY":576851,
}
states["population"]=states.index.map(populations)
states["relative_kill_per_mil"]=states.killings/states.population*1000000
choro = px.choropleth(data_frame=states,
locationmode="USA-states",
scope="usa",
locations=states.index,
color=states.relative_kill_per_mil,
color_continuous_scale=px.colors.sequential.Oranges,
range_color=[0,40],
title="Number of Police killings by US states.",
hover_name=states.index,
)
choro.update_layout(title_font_size=17,
width=900,
height=600,
margin={"r":150, "l":0})
choro.update_traces(showscale=True,
marker_line_color='white',
colorbar_bordercolor="white")
choro.show()
Insights:
For possible missing data we excluded last 2 months from the dataset.
df_fatalities_date = df_fatalities.set_index("date")
df_fatalities_by_month = df_fatalities_date.resample("M").name.count()[:-2]
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_fatalities_by_month.index,
y=df_fatalities_by_month.values,
line={"color":"#C06014"}))
fig.update_layout(title="Number of killings over time. (2015-2021)",
xaxis_title="year",
yaxis_title="count",
width=900,
height=500,
title_font_size=17)
fig.update_yaxes(tickfont_size=12,
title_font_size=17)
fig.update_xaxes(tickfont_size=14,
title_font_size=17)
fig.add_annotation(x="2018-09-30",
y=55,
text="Amber Guyger case",
font={"size":13},
showarrow=True)
fig.show()
Insights:
People have been shot and killed in encounters with officers even if they were unarmed or carrying a toy weapon. Such killings seem very unnecessary, but further understanding and investigation is required. Every time an officer shoots a person belonging to the minority, it does not necessarily mean racial bias. But, as our analysis showed, bias is not absent. In recent years, the fact that police officers could be racially biased was the primary motivation for the creation of the socially relevant and internationally discussed #BlackLivesMatter movement. We see it as a starting point to understand the truth behind possible racial bias in the population in general.